1 Executive Summary

In this project, we take a look at the airbnb data in San Francisco, and try to derive a model that can accurately predict the price of staying in airbnb’s, given certain criterias. Our best model, model 8, is one which accounts for a variety of erroneous factors that could possibly be relevant to our dependent variable, log(price_4_nights), which is the log transformation of the price for staying in San Francisco airbnb’s for 2 people, for 4 nights. Some of the variables in the model include: the type of property, the number of reviews, review scores (rating), type of room, number of bedrooms, how many people can be accommodated, roughly which neighborhood it is in. We ran a total of 8 models with variations of different combinations of variables available, and arrived at model8, which has an adjusted R-squared of 0.60 - which means that the independent variables are able to explain 60% of the variation in the dependent variable. From this model, we further arrive at a prediction of how much it costs for 2 people to stay in an airbnb in San Francisco downtown for four nights, given a few extra criteria.

2 Exploratory Data Analysis (EDA)

Variables in the dataframe:

  • price = cost per night
  • property_type: type of accommodation (House, Apartment, etc.)
  • room_type:
    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing
  • review_scores_rating: Average review score (0 - 100)
  • longitude , latitude: geographical coordinates to help us locate the listing
  • neighbourhood*: three variables on a few major neighbourhoods in each city

We start off by answering a few questions about the data based on data wrangling done below: - How many variables/columns? How many rows/observations? The data has 75 variables and 6566 observations - Which variables are numbers? 37 variables are numbers, as shown by the list below. In this answer we assume that dates are not numbers. - Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values? *22 categorical variables - What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

2.1 Data wrangling

#data type
glimpse(listings)
Rows: 6,566
Columns: 74
$ id                                           <dbl> 958, 5858, 7918, 8142, 83…
$ listing_url                                  <chr> "https://www.airbnb.com/r…
$ scrape_id                                    <dbl> 2.021101e+13, 2.021101e+1…
$ last_scraped                                 <date> 2021-10-06, 2021-10-06, …
$ name                                         <chr> "Bright, Modern Garden Un…
$ description                                  <chr> "Please check local laws …
$ neighborhood_overview                        <chr> "Quiet cul de sac in frie…
$ picture_url                                  <chr> "https://a0.muscache.com/…
$ host_id                                      <dbl> 1169, 8904, 21994, 21994,…
$ host_url                                     <chr> "https://www.airbnb.com/u…
$ host_name                                    <chr> "Holly", "Philip And Tani…
$ host_since                                   <date> 2008-07-31, 2009-03-02, …
$ host_location                                <chr> "San Francisco, Californi…
$ host_about                                   <chr> "We are a family of four …
$ host_response_time                           <chr> "within an hour", "N/A", …
$ host_response_rate                           <chr> "100%", "N/A", "100%", "1…
$ host_acceptance_rate                         <chr> "92%", "N/A", "100%", "10…
$ host_is_superhost                            <lgl> TRUE, FALSE, FALSE, FALSE…
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
$ host_picture_url                             <chr> "https://a0.muscache.com/…
$ host_neighbourhood                           <chr> "Duboce Triangle", "Berna…
$ host_listings_count                          <dbl> 1, 2, 10, 10, 2, 2, 1, 0,…
$ host_total_listings_count                    <dbl> 1, 2, 10, 10, 2, 2, 1, 0,…
$ host_verifications                           <chr> "['email', 'phone', 'face…
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ neighbourhood                                <chr> "San Francisco, Californi…
$ neighbourhood_cleansed                       <chr> "Western Addition", "Bern…
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude                                     <dbl> 37.77028, 37.74474, 37.76…
$ longitude                                    <dbl> -122.4332, -122.4209, -12…
$ property_type                                <chr> "Entire serviced apartmen…
$ room_type                                    <chr> "Entire home/apt", "Entir…
$ accommodates                                 <dbl> 3, 5, 2, 2, 4, 3, 4, 2, 3…
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text                               <chr> "1 bath", "1 bath", "4 sh…
$ bedrooms                                     <dbl> 1, 2, 1, 1, 2, 1, 2, NA, …
$ beds                                         <dbl> 2, 3, 1, 1, 2, 1, 3, 1, 3…
$ amenities                                    <chr> "[\"Keypad\", \"Refrigera…
$ price                                        <chr> "$160.00", "$235.00", "$5…
$ minimum_nights                               <dbl> 2, 30, 32, 32, 7, 13, 30,…
$ maximum_nights                               <dbl> 30, 60, 60, 90, 111, 14, …
$ minimum_minimum_nights                       <dbl> 2, 30, 32, 32, 7, 13, 30,…
$ maximum_minimum_nights                       <dbl> 2, 30, 32, 32, 7, 13, 30,…
$ minimum_maximum_nights                       <dbl> 1125, 60, 60, 90, 111, 14…
$ maximum_maximum_nights                       <dbl> 1125, 60, 60, 90, 111, 14…
$ minimum_nights_avg_ntm                       <dbl> 2, 30, 32, 32, 7, 13, 30,…
$ maximum_nights_avg_ntm                       <dbl> 1125, 60, 60, 90, 111, 14…
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30                              <dbl> 6, 30, 30, 11, 30, 23, 4,…
$ availability_60                              <dbl> 12, 60, 60, 41, 60, 47, 2…
$ availability_90                              <dbl> 18, 90, 90, 71, 90, 77, 5…
$ availability_365                             <dbl> 104, 365, 365, 346, 365, …
$ calendar_last_scraped                        <date> 2021-10-06, 2021-10-06, …
$ number_of_reviews                            <dbl> 302, 111, 19, 8, 28, 736,…
$ number_of_reviews_ltm                        <dbl> 40, 0, 0, 0, 0, 1, 2, 0, …
$ number_of_reviews_l30d                       <dbl> 5, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_review                                 <date> 2014-10-05, 2009-11-24, …
$ last_review                                  <date> 2021-09-17, 2015-08-28, …
$ review_scores_rating                         <dbl> 4.87, 4.88, 4.20, 4.63, 4…
$ review_scores_accuracy                       <dbl> 4.94, 4.85, 3.73, 4.38, 4…
$ review_scores_cleanliness                    <dbl> 4.95, 4.87, 3.87, 4.38, 5…
$ review_scores_checkin                        <dbl> 4.96, 4.89, 4.67, 4.75, 4…
$ review_scores_communication                  <dbl> 4.90, 4.85, 4.60, 4.75, 5…
$ review_scores_location                       <dbl> 4.98, 4.77, 4.73, 4.63, 4…
$ review_scores_value                          <dbl> 4.78, 4.68, 4.00, 4.63, 4…
$ license                                      <chr> "City Registration Pendin…
$ instant_bookable                             <lgl> FALSE, FALSE, FALSE, FALS…
$ calculated_host_listings_count               <dbl> 1, 1, 9, 9, 2, 2, 1, 1, 2…
$ calculated_host_listings_count_entire_homes  <dbl> 1, 1, 0, 0, 2, 0, 1, 1, 2…
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 9, 9, 0, 2, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month                            <dbl> 3.54, 0.77, 0.17, 0.10, 0…
#drops any non-numeric characters in price
listings <- listings %>% 
  mutate(price = parse_number(price)) 

Use typeof(listing$price) to confirm that price is now stored as a number.

#check price is a number
typeof(listings$price)
[1] "double"

We skim for missing data and check some statistics for price and accommodates variable

#filter for missing / skim data 
favstats(~price,data=listings) #favstats for price
minQ1medianQ3maxmeansdnmissing
0951502402.5e+0423568765660
favstats(~accommodates,data=listings) #favstats for accommodates (# of people)
minQ1medianQ3maxmeansdnmissing
0224163.091.8365660
listings%>%
  skim() %>%
  filter(n_missing > 0)
Data summary
Name Piped data
Number of rows 6566
Number of columns 74
_______________________
Column type frequency:
character 14
Date 3
logical 6
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
description 75 0.99 14 1000 0 5850 0
neighborhood_overview 1777 0.73 9 1000 0 3642 0
host_name 14 1.00 1 42 0 1850 0
host_location 20 1.00 2 62 0 236 0
host_about 1932 0.71 1 3409 0 2356 3
host_response_time 14 1.00 3 18 0 5 0
host_response_rate 14 1.00 2 4 0 48 0
host_acceptance_rate 14 1.00 2 4 0 92 0
host_thumbnail_url 14 1.00 55 106 0 3393 0
host_picture_url 14 1.00 57 109 0 3393 0
host_neighbourhood 418 0.94 3 31 0 162 0
neighbourhood 1777 0.73 28 54 0 6 0
bathrooms_text 10 1.00 6 17 0 30 0
license 2735 0.58 3 426 0 1648 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 14 1.00 2008-07-31 2021-09-28 2015-02-02 2127
first_review 1397 0.79 2009-09-25 2021-10-04 2018-12-26 2181
last_review 1397 0.79 2010-10-04 2021-10-05 2021-07-10 1090

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 14 1 0.44 FAL: 3670, TRU: 2882
host_has_profile_pic 14 1 0.99 TRU: 6490, FAL: 62
host_identity_verified 14 1 0.85 TRU: 5592, FAL: 960
neighbourhood_group_cleansed 6566 0 NaN :
bathrooms 6566 0 NaN :
calendar_updated 6566 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
host_listings_count 14 1.00 72.64 318.97 0.00 1.00 2.00 12.0 1987 ▇▁▁▁▁
host_total_listings_count 14 1.00 72.64 318.97 0.00 1.00 2.00 12.0 1987 ▇▁▁▁▁
bedrooms 933 0.86 1.51 0.86 1.00 1.00 1.00 2.0 9 ▇▁▁▁▁
beds 66 0.99 1.72 1.22 0.00 1.00 1.00 2.0 14 ▇▂▁▁▁
minimum_minimum_nights 2 1.00 24.10 55.20 1.00 2.00 30.00 30.0 1125 ▇▁▁▁▁
maximum_minimum_nights 2 1.00 39.62 116.87 1.00 2.00 30.00 30.0 1125 ▇▁▁▁▁
minimum_maximum_nights 2 1.00 687.45 548.24 1.00 70.00 1125.00 1125.0 10000 ▇▁▁▁▁
maximum_maximum_nights 2 1.00 7525390.19 126905437.12 1.00 90.00 1125.00 1125.0 2147483647 ▇▁▁▁▁
minimum_nights_avg_ntm 2 1.00 38.96 113.93 1.00 2.00 30.00 30.0 1125 ▇▁▁▁▁
maximum_nights_avg_ntm 2 1.00 7508363.68 126618320.95 1.00 90.00 1125.00 1125.0 2142625089 ▇▁▁▁▁
review_scores_rating 1397 0.79 4.73 0.56 0.00 4.71 4.89 5.0 5 ▁▁▁▁▇
review_scores_accuracy 1429 0.78 4.82 0.40 0.00 4.80 4.94 5.0 5 ▁▁▁▁▇
review_scores_cleanliness 1429 0.78 4.76 0.43 0.00 4.71 4.91 5.0 5 ▁▁▁▁▇
review_scores_checkin 1430 0.78 4.88 0.32 0.00 4.89 4.98 5.0 5 ▁▁▁▁▇
review_scores_communication 1429 0.78 4.86 0.37 1.00 4.88 4.98 5.0 5 ▁▁▁▁▇
review_scores_location 1430 0.78 4.80 0.39 0.00 4.77 4.91 5.0 5 ▁▁▁▁▇
review_scores_value 1430 0.78 4.66 0.45 0.00 4.58 4.76 4.9 5 ▁▁▁▁▇
reviews_per_month 1397 0.79 1.94 5.20 0.01 0.22 0.69 2.1 126 ▇▁▁▁▁

From price, we see that the data is probably very skewed towards the right, given its mean is almost at Q3. For accommodates, we see that the variable is probably also skewed towards the right, with most airbnb’s accommodating 2 people

2.2 Propery types

Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. We create the variable prop_type_simplified.

number_listings <- listings %>%
  group_by(property_type) %>%
  count(sort=TRUE) %>%
  kable(format = "html") %>%
  kable_classic()
number_listings
property_type n
Entire rental unit 1892
Private room in residential home 806
Entire residential home 756
Entire condominium (condo) 649
Private room in rental unit 515
Entire guest suite 445
Room in boutique hotel 394
Room in hotel 173
Private room in condominium (condo) 154
Entire serviced apartment 149
Entire loft 75
Private room in guest suite 63
Room in aparthotel 56
Shared room in hostel 52
Entire guesthouse 47
Entire townhouse 47
Private room in townhouse 37
Private room in hostel 35
Shared room in residential home 33
Private room in bed and breakfast 29
Shared room in rental unit 25
Shared room in bed and breakfast 14
Entire cottage 13
Private room in resort 13
Entire bungalow 10
Entire villa 10
Private room in serviced apartment 8
Entire resort 7
Private room in loft 5
Private room 4
Private room in casa particular 4
Private room in guesthouse 4
Private room in villa 4
Room in hostel 4
Shared room in villa 4
Private room in cottage 3
Room in bed and breakfast 3
Room in serviced apartment 3
Tiny house 3
Castle 2
Entire cabin 2
Entire place 2
Private room in bungalow 2
Shared room in condominium (condo) 2
Barn 1
Casa particular 1
Cycladic house 1
Entire in-law 1
Floor 1
Private room in farm stay 1
Private room in treehouse 1
Shared room in loft 1
#sum(number_listings$n)

The top 4 most common property types are Entire rental unit, Private room in residential home, Entire residential home, Entire condominium (condo). They make up 4103/6566 = 0.625 of the total listings.

listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in residential home", "Entire residential home","Entire condominium (condo)") ~ property_type, 
    TRUE ~ "Other"
  ))

Checking that prop_type_simplified was correctly made.

listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))        

property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit1892
Private room in residential homePrivate room in residential home806
Entire residential homeEntire residential home756
Entire condominium (condo)Entire condominium (condo)649
Private room in rental unitOther515
Entire guest suiteOther445
Room in boutique hotelOther394
Room in hotelOther173
Private room in condominium (condo)Other154
Entire serviced apartmentOther149
Entire loftOther75
Private room in guest suiteOther63
Room in aparthotelOther56
Shared room in hostelOther52
Entire guesthouseOther47
Entire townhouseOther47
Private room in townhouseOther37
Private room in hostelOther35
Shared room in residential homeOther33
Private room in bed and breakfastOther29
Shared room in rental unitOther25
Shared room in bed and breakfastOther14
Entire cottageOther13
Private room in resortOther13
Entire bungalowOther10
Entire villaOther10
Private room in serviced apartmentOther8
Entire resortOther7
Private room in loftOther5
Private roomOther4
Private room in casa particularOther4
Private room in guesthouseOther4
Private room in villaOther4
Room in hostelOther4
Shared room in villaOther4
Private room in cottageOther3
Room in bed and breakfastOther3
Room in serviced apartmentOther3
Tiny houseOther3
CastleOther2
Entire cabinOther2
Entire placeOther2
Private room in bungalowOther2
Shared room in condominium (condo)Other2
BarnOther1
Casa particularOther1
Cycladic houseOther1
Entire in-lawOther1
FloorOther1
Private room in farm stayOther1
Private room in treehouseOther1
Shared room in loftOther1
Now we have 5 distinct property types under prop_type_simplified

2.3 Number of nights

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

#most common value for minimum_nights
listings %>%
  group_by(minimum_nights) %>%
  count(sort=TRUE) %>%
  kable(format = "html") %>%
  kable_classic()
minimum_nights n
30 3012
1 1095
2 999
3 610
4 164
5 123
7 109
31 92
365 82
60 48
90 47
32 30
6 28
14 22
180 14
10 8
28 8
15 7
45 7
280 7
21 4
50 4
120 4
12 3
183 3
360 3
500 3
8 2
13 2
25 2
29 2
40 2
190 2
18 1
35 1
44 1
59 1
61 1
62 1
75 1
80 1
100 1
140 1
160 1
192 1
200 1
300 1
359 1
366 1
1000 1
1125 1
  • The most common values for the variable minimum_nights are 30, 1, 2, 3, 4, 5
  • Among the common values 30 is evidently larger than others
  • The likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights is to induce users to stay longer to reduce moving, cleaning and marketing costs

Filter the airbnb data so that it only includes observations with minimum_nights <= 4

#filter for data with less than 4 nights minimum 
listings_1 <- listings %>%
  filter(minimum_nights <= 4)

2.4 Visualizations

#histogram (price)
listings%>%
  ggplot(aes(x=price),binwidth=5)+
  geom_histogram()+
  theme_minimal()+
  labs(title = "Price vs. Count")

#plot with price less than 1000 
listings %>%
  filter(price<=1000) %>%
  ggplot(aes(x=price),binwidth=10) +
  geom_histogram()+
  theme_minimal()+
  labs(title="Price Under 1000 vs. Count",x="price under 1000")

  NULL
NULL

Most listings in San Francisco lie between $50 and $1000. There are a few outliers that go up to $10000. However, it can be seen in the histogram displaying listings below $1000 that majority of the listings are below $250 per night.

#plot for property_type vs average price 
listings %>%
  group_by(prop_type_simplified)%>%
  summarise(avg_price = mean(price)) %>%
  ggplot(aes(x=prop_type_simplified,y=avg_price))+
  geom_col()+
  labs(title = "Property Type vs. Average Price",x="property type",y="average price")

The bar graph displays average price by the type of listing. It helps us see what kind of properties will have a higher price overall. Entire residential homes have highest prices as these properties will have more amenities and space, while a room within a home have lowest average price due to lower privacy and luxury.

2.5 Correlations between variables

#correlation between each variable (done after price is a number (not string))
listings %>%
  select(price,accommodates,number_of_reviews,
         bedrooms,beds,review_scores_rating,review_scores_cleanliness,
         review_scores_location,review_scores_value,reviews_per_month) %>%
  ggpairs(alpha=0.3)+
  labs(title = "Correlation Between Each Variable")+
  theme_bw()

Variables “accommodates” “bedrooms” “beds” exhibit a strong positive correlation with one another. The different cleaning scores also have high correlations with each other - which may result in collinearity issues if we put them in the same model. Most variables listed all seem to correlate with price.

#correlation (ggpairs with the filter for less than equal to 4 nights)
listings_1 %>%
  select(price,accommodates,number_of_reviews,bedrooms,beds,review_scores_rating,review_scores_cleanliness,review_scores_location,review_scores_value,reviews_per_month) %>%
  ggpairs(alpha=0.3)+
  labs(title = "Correlation Between Each Variable (<= 4 nights)")+
  theme_bw()

An important metric to understand the statistical landscape of our model is to understand the collinearity between the variables. The above diagrams explain this. Here we use the dataset that filters out minimum nights >= 4, as we’re looking at predicting a 4 night stay in San Francisco. The results are quite similar to the previous ggpairs plot, and most variables listed are also correlated with price.

3 Mapping

Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualisations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.

The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

4 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four nights.

We shall first create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

#creating price_4_nights variable

listings_filtered <- listings %>%
  filter(accommodates>=2, minimum_nights <= 4, maximum_nights >= 4) %>%
  mutate(price_4_nights = 4*price)

Then, we shall use histograms & density plots to examine the distributions of price_4_nights and `log(price_4_nights). In later analysis, we shall use the variable log(price_4_nights) because it is more normally distributed while the variable ‘price_4_nights’ is heavily right-skewed.

listings_filtered %>%
  ggplot(aes(x=price_4_nights)) +
  geom_histogram()+
  labs(title="Price For 4 Nights vs. Count",x="price for 4 nights")

#filter for price less than 5000
listings_filtered %>%
  filter(price_4_nights <= 5000) %>%
  ggplot(aes(x=price_4_nights)) +
  geom_histogram()+
  labs(title="Price For 4 Nights (<5000) vs. Count",x="price for 4 nights")

Without adjusting for range of prices, we see that the distribution of price is highly skewed towards the right. When we limit the price to under 5000, we see that the distribution is still skewed to the right.

As a next step, we log the price_4_nights variable:

#log price of 4 nights 
listings_filtered_log <- listings_filtered %>%
  mutate(log_price4 = log(price_4_nights))

listings_filtered_log %>%
  ggplot(aes(x=log_price4)) +
  geom_histogram() +
  labs(title="Log Price For 4 Nights vs. Count", x="log price for 4 nights")

listings_filtered_log %>%
  ggplot(aes(x=log_price4)) +
  geom_density() +
  labs(title="Log Price For 4 Nights Density Graph", x="log price for 4 nights")

Now, the distribution looks to be more normally distributed using the log-linear model - although it is still slightly skewed towards the right given the outliers.

In the following regression models, we will use both Summary and Anova functions to determine significance of variables and compare categorical variables with more than 2 levels. We also check for collinearity using the vif function

4.1 Model 1

First, we shall fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

model1 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating, data=listings_filtered_log)
# test of individual levels against the reference level
msummary(model1)
                                                       Estimate Std. Error
(Intercept)                                           6.8122074  0.1571685
prop_type_simplifiedEntire rental unit               -0.1767883  0.0530409
prop_type_simplifiedEntire residential home           0.0949705  0.0517947
prop_type_simplifiedOther                            -0.5800841  0.0454309
prop_type_simplifiedPrivate room in residential home -1.0227953  0.0521221
number_of_reviews                                    -0.0012898  0.0001067
review_scores_rating                                  0.0779146  0.0309297
                                                     t value Pr(>|t|)    
(Intercept)                                           43.343  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -3.333 0.000872 ***
prop_type_simplifiedEntire residential home            1.834 0.066835 .  
prop_type_simplifiedOther                            -12.768  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -19.623  < 2e-16 ***
number_of_reviews                                    -12.085  < 2e-16 ***
review_scores_rating                                   2.519 0.011829 *  

Residual standard error: 0.5721 on 2467 degrees of freedom
  (247 observations deleted due to missingness)
Multiple R-squared:  0.3298,    Adjusted R-squared:  0.3281 
F-statistic: 202.3 on 6 and 2467 DF,  p-value: < 2.2e-16
# test of the factor as a whole
anova(model1)
DfSum SqMean SqF valuePr(>F)
4349   87.2  266   2.05e-190
146.6 46.6  142   6.17e-32 
12.082.08 6.350.0118   
2467808   0.327           
car::vif(model1)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.081898  4        1.009888
number_of_reviews    1.040213  1        1.019909
review_scores_rating 1.049484  1        1.024443
#check the residuals
autoplot(model1)

This regression shows that With every 1 point increase in the review scores rating, the total cost for two people to stay at an Airbnb property for 4 nights would increase for a bit.

For property types, R is automatically ignoring one type – Entire condominium (condo) for us, making it the reference group. So the coefficient of Entire rental unit means that other factors being equal, the 4 night price for Entire rental unit is about -0.58% lower on average compared to the reference group Entire condominium (condo). And the coefficients of Entire residential home, Other, and Private room in residential home can be interpreted similarly. Condos seem to be more expensive than all other property types (excluding the insignificant residential home category).

From this regression, we can see that only the explanatory variables review_scores_rating and prop_type_simplified (Entire residential home) have a positive correlation with log_price4. For both of these variables, the t values exceeds 1.96 providing us with sufficient evidence at the 5% significance level to conclude that it’s significantly different from zero.

Further, the VIF between these variables is less than 5. As a result, the likelihood of multicollinearity remains low.

In addition, from the Q-Q plot, we can see a dispersion from the 45 degree line occurring after the second quantile, indicating that this sample distribution exhibits kurtosis and is skewed to the right.

4.2 Model 2

We would now like to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. As such, we shall fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.

model2 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data=listings_filtered_log)
msummary(model2)
                                                       Estimate Std. Error
(Intercept)                                           6.7550695  0.1573896
prop_type_simplifiedEntire rental unit               -0.1821767  0.0513129
prop_type_simplifiedEntire residential home           0.0918358  0.0501024
prop_type_simplifiedOther                            -0.4920510  0.0479920
prop_type_simplifiedPrivate room in residential home -0.8836573  0.0608969
number_of_reviews                                    -0.0011940  0.0001047
review_scores_rating                                  0.0889932  0.0310289
room_typeHotel room                                   0.2935523  0.0756587
room_typePrivate room                                -0.1462458  0.0335152
room_typeShared room                                 -1.1900249  0.1028209
                                                     t value Pr(>|t|)    
(Intercept)                                           42.919  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -3.550 0.000392 ***
prop_type_simplifiedEntire residential home            1.833 0.066929 .  
prop_type_simplifiedOther                            -10.253  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -14.511  < 2e-16 ***
number_of_reviews                                    -11.402  < 2e-16 ***
review_scores_rating                                   2.868 0.004165 ** 
room_typeHotel room                                    3.880 0.000107 ***
room_typePrivate room                                 -4.364 1.33e-05 ***
room_typeShared room                                 -11.574  < 2e-16 ***

Residual standard error: 0.5534 on 2464 degrees of freedom
  (247 observations deleted due to missingness)
Multiple R-squared:  0.3737,    Adjusted R-squared:  0.3714 
F-statistic: 163.3 on 9 and 2464 DF,  p-value: < 2.2e-16
car::vif(model2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.290384  4        1.109146
number_of_reviews    1.070214  1        1.034512
review_scores_rating 1.128889  1        1.062492
room_type            2.364948  3        1.154260
anova(model2)
DfSum SqMean SqF valuePr(>F)
4349   87.2  285   2.36e-201
146.6 46.6  152   5.91e-34 
12.082.08 6.780.00926  
352.9 17.6  57.6 5.73e-36 
2464755   0.306           
autoplot(model2)

From this regression, we can see that the additional variable room type has a positive effect on log_price4 when the room is a hotel room, and a negative one when the room is either a private or shared room. The base case defined in R is “entire home / apartment,” which explains why private rooms and shared rooms have negative coefficients, as they’re usually cheaper than living in entire homes. For each of these variables, the t values exceeds 1.96 providing us with sufficient evidence at the 5% significance level to conclude that the variable coefficients deviate significantly from zero.

In addition, though greater than in the previous regression, the VIF between the variables remains less than 5. Therefore, the likelihood of multicolinearity remains low.

Also, as with the previous regression, we can infer from the Q-Q plot that this sample distribution exhibits kurtosis and is skewed to the right.

4.3 Model 3

  1. Are the number of bathrooms, bedrooms, beds, or size of the house (accommodates) significant predictors of price_4_nights? Or might these be co-linear variables? Next, we shall analyse the potential effects of ‘bathrooms’, ‘bedrooms’,‘beds’ or ‘accommodates’ on ‘log_price4’ and determine the co-linearity between these variables.
model3 <- lm(log_price4 ~ bedrooms + beds + bathrooms_text + accommodates, data = listings_filtered_log)
msummary(model3)
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.75540    0.15967  36.045  < 2e-16 ***
bedrooms                         0.22276    0.02983   7.468 1.15e-13 ***
beds                            -0.08778    0.01690  -5.194 2.24e-07 ***
bathrooms_text0 shared baths    -0.36630    0.33534  -1.092 0.274798    
bathrooms_text1 bath             0.50773    0.15954   3.183 0.001480 ** 
bathrooms_text1 private bath     0.35554    0.16030   2.218 0.026654 *  
bathrooms_text1 shared bath     -0.05880    0.16289  -0.361 0.718129    
bathrooms_text1.5 baths          0.68987    0.16933   4.074 4.78e-05 ***
bathrooms_text1.5 shared baths  -0.02169    0.18094  -0.120 0.904582    
bathrooms_text10 baths          -0.09856    0.44698  -0.221 0.825493    
bathrooms_text10 shared baths   -0.84395    0.21623  -3.903 9.77e-05 ***
bathrooms_text2 baths            0.75631    0.16530   4.576 5.00e-06 ***
bathrooms_text2 shared baths     0.02393    0.19368   0.124 0.901680    
bathrooms_text2.5 baths          0.86175    0.18585   4.637 3.74e-06 ***
bathrooms_text2.5 shared baths  -0.16290    0.44693  -0.364 0.715524    
bathrooms_text3 baths            0.96823    0.18803   5.149 2.84e-07 ***
bathrooms_text3 shared baths    -0.12170    0.61220  -0.199 0.842450    
bathrooms_text3.5 baths          1.36962    0.22036   6.215 6.07e-10 ***
bathrooms_text4 baths            0.86187    0.25065   3.439 0.000595 ***
bathrooms_text4 shared baths    -0.24776    0.26205  -0.945 0.344515    
bathrooms_text4.5 baths          1.38655    0.30222   4.588 4.72e-06 ***
bathrooms_text5 baths            1.91287    0.46182   4.142 3.57e-05 ***
bathrooms_text5 shared baths     0.20971    0.24484   0.857 0.391804    
bathrooms_text6 shared baths     0.76533    0.44887   1.705 0.088327 .  
bathrooms_text6.5 shared baths   0.66997    0.61302   1.093 0.274557    
bathrooms_textPrivate half-bath -0.03078    0.37618  -0.082 0.934803    
bathrooms_textShared half-bath  -0.34844    0.44693  -0.780 0.435699    
accommodates                     0.09137    0.01378   6.632 4.11e-11 ***

Residual standard error: 0.5912 on 2288 degrees of freedom
  (405 observations deleted due to missingness)
Multiple R-squared:  0.424, Adjusted R-squared:  0.4172 
F-statistic: 62.37 on 27 and 2288 DF,  p-value: < 2.2e-16
car::vif(model3) #vif < 5 then no multicollinearity 
                   GVIF Df GVIF^(1/(2*Df))
bedrooms       4.339735  1        2.083203
beds           3.461104  1        1.860404
bathrooms_text 2.751117 24        1.021307
accommodates   4.847430  1        2.201688
anova(model3)
DfSum SqMean SqF valuePr(>F)
1414   414   1.18e+031.56e-209
14.444.4412.7     0.000373 
24155   6.4518.5     2.99e-71 
115.4 15.4 44       4.11e-11 
2288800   0.35               
autoplot(model3)

From this regression, we can see that ‘bedrooms’ and ‘accommodates’ are positively correlated with the dependent variable, whilst for ‘beds’ is opposite. For each of these variables, the t values exceeds 1.96 providing us with sufficient evidence at the 5% significance level to conclude that the variable coefficients deviate significantly from zero. Certain categories under the bathrooms_text variable also have significant coefficients

In addition, the VIF continues to rise, though remains less than 5, indicating low likelihood of co-linearity

Further, the sample distribution exhibits the greatest level of kurtosis compared to the previous models.

4.4 Model 4

  1. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?

Since the four variables run in model3 are significant, we will add them to the variables used in model2 to create model4.

#add all variables 
model4 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + accommodates + bedrooms + beds + bathrooms_text , data=listings_filtered_log)
msummary(model4)
                                                       Estimate Std. Error
(Intercept)                                           5.870e+00  2.153e-01
prop_type_simplifiedEntire rental unit               -1.826e-01  4.576e-02
prop_type_simplifiedEntire residential home          -2.625e-01  4.563e-02
prop_type_simplifiedOther                            -3.218e-01  4.450e-02
prop_type_simplifiedPrivate room in residential home -6.444e-01  5.570e-02
number_of_reviews                                    -8.653e-04  9.307e-05
review_scores_rating                                  6.805e-02  2.806e-02
room_typeHotel room                                   4.890e-01  7.370e-02
room_typePrivate room                                -8.377e-02  4.922e-02
room_typeShared room                                 -6.073e-01  1.504e-01
accommodates                                          6.591e-02  1.132e-02
bedrooms                                              2.137e-01  2.541e-02
beds                                                 -5.273e-02  1.529e-02
bathrooms_text0 shared baths                         -2.902e-01  2.761e-01
bathrooms_text1 bath                                  3.854e-01  1.579e-01
bathrooms_text1 private bath                          4.186e-01  1.541e-01
bathrooms_text1 shared bath                           1.356e-01  1.561e-01
bathrooms_text1.5 baths                               5.911e-01  1.628e-01
bathrooms_text1.5 shared baths                        1.918e-01  1.686e-01
bathrooms_text10 baths                                2.397e-02  3.559e-01
bathrooms_text10 shared baths                        -2.386e-01  2.245e-01
bathrooms_text2 baths                                 6.293e-01  1.601e-01
bathrooms_text2 shared baths                          2.795e-01  1.765e-01
bathrooms_text2.5 baths                               7.598e-01  1.733e-01
bathrooms_text2.5 shared baths                       -1.994e-01  4.798e-01
bathrooms_text3 baths                                 7.573e-01  1.757e-01
bathrooms_text3 shared baths                         -9.154e-02  4.804e-01
bathrooms_text3.5 baths                               1.290e+00  1.974e-01
bathrooms_text4 baths                                 7.530e-01  2.161e-01
bathrooms_text4 shared baths                         -3.467e-01  2.218e-01
bathrooms_text4.5 baths                               9.072e-01  2.665e-01
bathrooms_text5 baths                                 1.874e+00  3.696e-01
bathrooms_text5 shared baths                          3.995e-01  2.166e-01
bathrooms_text6 shared baths                          7.412e-01  3.578e-01
bathrooms_textPrivate half-bath                       3.745e-01  3.054e-01
bathrooms_textShared half-bath                       -1.111e-01  3.563e-01
                                                     t value Pr(>|t|)    
(Intercept)                                           27.270  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -3.991 6.80e-05 ***
prop_type_simplifiedEntire residential home           -5.753 1.01e-08 ***
prop_type_simplifiedOther                             -7.230 6.78e-13 ***
prop_type_simplifiedPrivate room in residential home -11.571  < 2e-16 ***
number_of_reviews                                     -9.298  < 2e-16 ***
review_scores_rating                                   2.425 0.015389 *  
room_typeHotel room                                    6.635 4.12e-11 ***
room_typePrivate room                                 -1.702 0.088888 .  
room_typeShared room                                  -4.039 5.56e-05 ***
accommodates                                           5.823 6.67e-09 ***
bedrooms                                               8.413  < 2e-16 ***
beds                                                  -3.448 0.000576 ***
bathrooms_text0 shared baths                          -1.051 0.293427    
bathrooms_text1 bath                                   2.441 0.014717 *  
bathrooms_text1 private bath                           2.717 0.006647 ** 
bathrooms_text1 shared bath                            0.869 0.385016    
bathrooms_text1.5 baths                                3.632 0.000289 ***
bathrooms_text1.5 shared baths                         1.138 0.255366    
bathrooms_text10 baths                                 0.067 0.946312    
bathrooms_text10 shared baths                         -1.063 0.287837    
bathrooms_text2 baths                                  3.932 8.71e-05 ***
bathrooms_text2 shared baths                           1.583 0.113491    
bathrooms_text2.5 baths                                4.384 1.22e-05 ***
bathrooms_text2.5 shared baths                        -0.416 0.677726    
bathrooms_text3 baths                                  4.310 1.71e-05 ***
bathrooms_text3 shared baths                          -0.191 0.848886    
bathrooms_text3.5 baths                                6.531 8.19e-11 ***
bathrooms_text4 baths                                  3.485 0.000503 ***
bathrooms_text4 shared baths                          -1.563 0.118175    
bathrooms_text4.5 baths                                3.404 0.000677 ***
bathrooms_text5 baths                                  5.072 4.30e-07 ***
bathrooms_text5 shared baths                           1.844 0.065280 .  
bathrooms_text6 shared baths                           2.072 0.038396 *  
bathrooms_textPrivate half-bath                        1.226 0.220239    
bathrooms_textShared half-bath                        -0.312 0.755228    

Residual standard error: 0.4551 on 2059 degrees of freedom
  (626 observations deleted due to missingness)
Multiple R-squared:  0.6108,    Adjusted R-squared:  0.6042 
F-statistic: 92.31 on 35 and 2059 DF,  p-value: < 2.2e-16
car::vif(model4)
                          GVIF Df GVIF^(1/(2*Df))
prop_type_simplified  3.529757  4        1.170761
number_of_reviews     1.113927  1        1.055428
review_scores_rating  1.128418  1        1.062270
room_type            19.054339  3        1.634302
accommodates          5.117136  1        2.262109
bedrooms              4.930019  1        2.220365
beds                  4.488917  1        2.118707
bathrooms_text       28.140319 23        1.075244
anova(model4)
DfSum SqMean SqF valuePr(>F)
4352   87.9  424   6.88e-267
148.5 48.5  234   3.7e-50  
11.711.71 8.250.00413  
368.2 22.7  110   6.42e-66 
1122   122    588   1.91e-114
131   31    150   2.85e-33 
12.882.88 13.9 0.000196 
2343.6 1.89 9.151.96e-30 
2059427   0.207           
autoplot(model4)

Running through model4, we see that the categorical variables bathrooms_text and room_type have very high VIF - indicating collinearity issues. Therefore, we try dropping the bathrooms_text variable to see if that eliminates the collinearity issue.

#bathroom text dropped (VIF high)
model4_1 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + accommodates + bedrooms + beds  , data=listings_filtered_log)
msummary(model4_1)
                                                       Estimate Std. Error
(Intercept)                                           6.066e+00  1.512e-01
prop_type_simplifiedEntire rental unit               -1.909e-01  4.769e-02
prop_type_simplifiedEntire residential home          -2.364e-01  4.730e-02
prop_type_simplifiedOther                            -3.140e-01  4.606e-02
prop_type_simplifiedPrivate room in residential home -6.487e-01  5.725e-02
number_of_reviews                                    -9.263e-04  9.626e-05
review_scores_rating                                  8.192e-02  2.908e-02
room_typeHotel room                                   4.681e-01  7.220e-02
room_typePrivate room                                -1.449e-01  3.337e-02
room_typeShared room                                 -1.046e+00  1.006e-01
accommodates                                          8.496e-02  1.138e-02
bedrooms                                              2.933e-01  2.394e-02
beds                                                 -5.421e-02  1.517e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           40.128  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -4.003 6.47e-05 ***
prop_type_simplifiedEntire residential home           -4.998 6.27e-07 ***
prop_type_simplifiedOther                             -6.818 1.21e-11 ***
prop_type_simplifiedPrivate room in residential home -11.331  < 2e-16 ***
number_of_reviews                                     -9.623  < 2e-16 ***
review_scores_rating                                   2.817  0.00489 ** 
room_typeHotel room                                    6.483 1.12e-10 ***
room_typePrivate room                                 -4.343 1.47e-05 ***
room_typeShared room                                 -10.400  < 2e-16 ***
accommodates                                           7.465 1.22e-13 ***
bedrooms                                              12.250  < 2e-16 ***
beds                                                  -3.574  0.00036 ***

Residual standard error: 0.4752 on 2082 degrees of freedom
  (626 observations deleted due to missingness)
Multiple R-squared:  0.571, Adjusted R-squared:  0.5685 
F-statistic: 230.9 on 12 and 2082 DF,  p-value: < 2.2e-16
car::vif(model4_1)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 3.119093  4        1.152799
number_of_reviews    1.093281  1        1.045601
review_scores_rating 1.111871  1        1.054453
room_type            3.470929  3        1.230479
accommodates         4.746393  1        2.178622
bedrooms             4.016734  1        2.004179
beds                 4.051395  1        2.012808
anova(model4_1)
DfSum SqMean SqF valuePr(>F)
4352   87.9  389   1.44e-250
148.5 48.5  215   2.22e-46 
11.711.71 7.570.006    
368.2 22.7  101   7.32e-61 
1122   122    539   2.79e-106
131   31    137   9.34e-31 
12.882.88 12.8 0.00036  
2082470   0.226           
autoplot(model4_1)

After dropping the bathrooms_text variable, the collinearity issue is resolved. We then proceed with this model and include the host_is_superhost variable to determine whether it has any additional explanatory power than what we have from model4_1 already.

4.5 Model 5

#adding superhost variable (insignificant)
model5 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + accommodates + bedrooms + beds + factor(host_is_superhost), data=listings_filtered_log)
msummary(model5)
                                                       Estimate Std. Error
(Intercept)                                           6.096e+00  1.523e-01
prop_type_simplifiedEntire rental unit               -1.935e-01  4.787e-02
prop_type_simplifiedEntire residential home          -2.391e-01  4.749e-02
prop_type_simplifiedOther                            -3.182e-01  4.625e-02
prop_type_simplifiedPrivate room in residential home -6.614e-01  5.781e-02
number_of_reviews                                    -9.561e-04  9.833e-05
review_scores_rating                                  7.174e-02  2.991e-02
room_typeHotel room                                   4.818e-01  7.278e-02
room_typePrivate room                                -1.357e-01  3.395e-02
room_typeShared room                                 -1.046e+00  1.006e-01
accommodates                                          8.550e-02  1.139e-02
bedrooms                                              2.950e-01  2.395e-02
beds                                                 -5.533e-02  1.518e-02
factor(host_is_superhost)TRUE                         3.454e-02  2.350e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           40.022  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -4.042 5.49e-05 ***
prop_type_simplifiedEntire residential home           -5.034 5.22e-07 ***
prop_type_simplifiedOther                             -6.880 7.88e-12 ***
prop_type_simplifiedPrivate room in residential home -11.441  < 2e-16 ***
number_of_reviews                                     -9.724  < 2e-16 ***
review_scores_rating                                   2.399 0.016540 *  
room_typeHotel room                                    6.620 4.56e-11 ***
room_typePrivate room                                 -3.997 6.63e-05 ***
room_typeShared room                                 -10.403  < 2e-16 ***
accommodates                                           7.504 9.10e-14 ***
bedrooms                                              12.315  < 2e-16 ***
beds                                                  -3.645 0.000274 ***
factor(host_is_superhost)TRUE                          1.470 0.141693    

Residual standard error: 0.475 on 2079 degrees of freedom
  (628 observations deleted due to missingness)
Multiple R-squared:  0.5715,    Adjusted R-squared:  0.5688 
F-statistic: 213.3 on 13 and 2079 DF,  p-value: < 2.2e-16
car::vif(model5)
                              GVIF Df GVIF^(1/(2*Df))
prop_type_simplified      3.210358  4        1.156962
number_of_reviews         1.140702  1        1.068037
review_scores_rating      1.176100  1        1.084481
room_type                 3.619412  3        1.239100
accommodates              4.755881  1        2.180798
bedrooms                  4.021518  1        2.005372
beds                      4.059050  1        2.014708
factor(host_is_superhost) 1.228523  1        1.108387
anova(model5)
DfSum SqMean SqF valuePr(>F)
4351    87.8  389   2.71e-250
148.5  48.5  215   2.1e-46  
11.71 1.71 7.590.00593  
368.2  22.7  101   6.98e-61 
1122    122    539   3.6e-106 
131.1  31.1  138   7.22e-31 
12.89 2.89 12.8 0.00035  
10.4880.4882.160.142    
2079469    0.226           
autoplot(model5)

According to our regression analysis, after controlling for other variables, superhosts (host_is_superhost) command a discount to the market. However, this point estimate is not significant at the 5% significence level. We will not be including this variable in further models.

4.6 Model 6

  1. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?

Using the same variables defined in model4_1 and adding the instant_bookable factor:

#significant 
model6 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + accommodates + bedrooms + beds + factor(instant_bookable), data=listings_filtered_log)
msummary(model6)
                                                       Estimate Std. Error
(Intercept)                                           6.1332122  0.1508618
prop_type_simplifiedEntire rental unit               -0.1816031  0.0474458
prop_type_simplifiedEntire residential home          -0.2251862  0.0470785
prop_type_simplifiedOther                            -0.2956082  0.0459296
prop_type_simplifiedPrivate room in residential home -0.6535120  0.0569190
number_of_reviews                                    -0.0009338  0.0000957
review_scores_rating                                  0.0740351  0.0289516
room_typeHotel room                                   0.5316431  0.0728597
room_typePrivate room                                -0.1184835  0.0335761
room_typeShared room                                 -1.0521225  0.1000352
accommodates                                          0.0850067  0.0113137
bedrooms                                              0.2848546  0.0238591
beds                                                 -0.0503055  0.0150978
factor(instant_bookable)TRUE                         -0.1139822  0.0224776
                                                     t value Pr(>|t|)    
(Intercept)                                           40.655  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -3.828 0.000133 ***
prop_type_simplifiedEntire residential home           -4.783 1.85e-06 ***
prop_type_simplifiedOther                             -6.436 1.52e-10 ***
prop_type_simplifiedPrivate room in residential home -11.481  < 2e-16 ***
number_of_reviews                                     -9.757  < 2e-16 ***
review_scores_rating                                   2.557 0.010622 *  
room_typeHotel room                                    7.297 4.17e-13 ***
room_typePrivate room                                 -3.529 0.000427 ***
room_typeShared room                                 -10.518  < 2e-16 ***
accommodates                                           7.514 8.49e-14 ***
bedrooms                                              11.939  < 2e-16 ***
beds                                                  -3.332 0.000877 ***
factor(instant_bookable)TRUE                          -5.071 4.31e-07 ***

Residual standard error: 0.4724 on 2081 degrees of freedom
  (626 observations deleted due to missingness)
Multiple R-squared:  0.5762,    Adjusted R-squared:  0.5736 
F-statistic: 217.7 on 13 and 2081 DF,  p-value: < 2.2e-16
car::vif(model6)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     3.198388  4        1.156422
number_of_reviews        1.093539  1        1.045724
review_scores_rating     1.115087  1        1.055977
room_type                3.626327  3        1.239494
accommodates             4.746396  1        2.178623
bedrooms                 4.036335  1        2.009063
beds                     4.061936  1        2.015425
factor(instant_bookable) 1.141187  1        1.068264
anova(model6)
DfSum SqMean SqF valuePr(>F)
4352   87.9  394   7.96e-253
148.5 48.5  217   6.98e-47 
11.711.71 7.650.00571  
368.2 22.7  102   1.55e-61 
1122   122    546   2.22e-107
131   31    139   4.33e-31 
12.882.88 12.9 0.000332 
15.745.74 25.7 4.31e-07 
2081464   0.223           
autoplot(model6)

From model6 we see that the instant_bookable variable is significant, and being instantly bookable actually slightly lowers the price of airbnb for 4 nights, given all else constant. This is possibly because of the instant_bookable function relying on lower prices to actually be booked quickly. The adjusted R-squared here is quite high at 0.574, with no multicollinearity issues based on the VIF. As illustrated through the Q-Q diagram, we can see that the addition of these variables leads to the distribution becoming more normal.

4.7 Model 7

  1. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, so we would not include all of them in the model. Therefore, we define a new variable neighbourhood_simplified, which identifies the top neighborhoods in San Francisco, and groups the other neighborhoods as “other”, in order to determine whether location is a predictor of price_4_nights.
# determine neighbourhoods where the majority of listings falls in
listings_filtered_log %>% 
  group_by(neighbourhood_cleansed) %>% 
  count() %>% 
  arrange(desc(n))
# A tibble: 36 × 2
# Groups:   neighbourhood_cleansed [36]
   neighbourhood_cleansed     n
   <chr>                  <int>
 1 Downtown/Civic Center    493
 2 Mission                  171
 3 Outer Sunset             150
 4 Bernal Heights           146
 5 Western Addition         133
 6 Castro/Upper Market      121
 7 Financial District       100
 8 Haight Ashbury            99
 9 Chinatown                 98
10 Noe Valley                98
# … with 26 more rows

We divide up San Francisco into 5 districts: Downtown, Outside Lands, Western Addition, Southern, and North of Downtown. This breakdown is roughly based on San Francisco’s own categorization of its different neighborhoods.

# create a new categorical variable 

listings_filtered_log <- listings_filtered_log %>%
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_cleansed == "Downtown/Civic Center" ~ "Downtown", 
    neighbourhood_cleansed == "Financial District" ~ "Downtown", 
    neighbourhood_cleansed == "Haight Ashbury" ~ "Downtown", 
    neighbourhood_cleansed == "Chinatown" ~ "Downtown",
    neighbourhood_cleansed == "Nob Hill" ~ "Downtown", 
    neighbourhood_cleansed == "South of Market" ~ "Downtown", 
    neighbourhood_cleansed == "North Beach" ~ "Downtown", 
    neighbourhood_cleansed == "Golden Gate Park" ~ "Downtown", 
    neighbourhood_cleansed == "Russian Hill" ~ "North of Downtown", 
    neighbourhood_cleansed == "Marina" ~ "North of Downtown", 
    neighbourhood_cleansed == "Pacific Heights" ~ "North of Downtown", 
    neighbourhood_cleansed == "Ocean View" ~ "North of Downtown", 
    neighbourhood_cleansed == "West of Twin Peaks" ~ "North of Downtown",
    neighbourhood_cleansed == "Twin Peaks" ~ "North of Downtown", 
    neighbourhood_cleansed == "Seacliff" ~ "North of Downtown",
    neighbourhood_cleansed == "Presidio" ~ "North of Downtown",
    neighbourhood_cleansed == "Outer Sunset" ~ "Outside Lands",
    neighbourhood_cleansed == "Outher Richmond" ~ "Outside Lands",
    neighbourhood_cleansed == "Inner Richmond" ~ "Outside Lands",
    neighbourhood_cleansed == "Outer Richmond" ~ "Outside Lands",
    neighbourhood_cleansed == "Parkside" ~ "Outside Lands",
    neighbourhood_cleansed == "Inner Sunset" ~ "Outside Lands",
    neighbourhood_cleansed == "Lakeshore" ~ "Outside Lands",
    neighbourhood_cleansed == "Crocker Amazon" ~ "Outside Lands",
    neighbourhood_cleansed == "Presidio Heights" ~ "Outside Lands",
    neighbourhood_cleansed == "Western Addition" ~ "Western Addition",
    neighbourhood_cleansed == "Mission" ~ "Southern",
    neighbourhood_cleansed == "Bernal Heights" ~ "Southern",
    neighbourhood_cleansed == "Castro/Upper Market" ~ "Southern",
    neighbourhood_cleansed == "Noe Valley" ~ "Southern",
    neighbourhood_cleansed == "Bayview" ~ "Southern",
    neighbourhood_cleansed == "Potrero Hill" ~ "Southern",
    neighbourhood_cleansed == "Outer Mission" ~ "Southern",
    neighbourhood_cleansed == "Excelsior" ~ "Southern",
    neighbourhood_cleansed == "Visitacion Valley" ~ "Southern",
    neighbourhood_cleansed == "Glen Park" ~ "Southern",
    neighbourhood_cleansed == "Diamond Heights" ~ "Southern")) 


listings_filtered_log %>%
  count(neighbourhood_simplified) %>%
  arrange(desc(n))     
neighbourhood_simplifiedn
Downtown1027
Southern835
Outside Lands448
North of Downtown278
Western Addition133

Now, we create a model that includes the variables prop_type_simplified, number of reviews, review score ratings, room type, number of bedrooms, beds and bathrooms, bathrooms_text and the neighborhood_simplified variable we just created.

model7 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + beds+ bathrooms_text + accommodates + neighbourhood_simplified, data=listings_filtered_log)
msummary(model7)
                                                       Estimate Std. Error
(Intercept)                                           5.816e+00  2.091e-01
prop_type_simplifiedEntire rental unit               -1.213e-01  4.475e-02
prop_type_simplifiedEntire residential home          -1.625e-01  4.539e-02
prop_type_simplifiedOther                            -2.489e-01  4.393e-02
prop_type_simplifiedPrivate room in residential home -4.595e-01  5.743e-02
number_of_reviews                                    -7.130e-04  9.166e-05
review_scores_rating                                  8.403e-02  2.743e-02
room_typeHotel room                                   3.263e-01  7.383e-02
room_typePrivate room                                -2.027e-01  4.948e-02
room_typeShared room                                 -8.192e-01  1.480e-01
bedrooms                                              2.382e-01  2.489e-02
beds                                                 -5.173e-02  1.484e-02
bathrooms_text0 shared baths                         -2.645e-01  2.676e-01
bathrooms_text1 bath                                  4.348e-01  1.532e-01
bathrooms_text1 private bath                          4.877e-01  1.498e-01
bathrooms_text1 shared bath                           2.319e-01  1.520e-01
bathrooms_text1.5 baths                               6.424e-01  1.580e-01
bathrooms_text1.5 shared baths                        3.033e-01  1.642e-01
bathrooms_text10 baths                                8.285e-03  3.450e-01
bathrooms_text10 shared baths                        -1.723e-01  2.178e-01
bathrooms_text2 baths                                 6.762e-01  1.553e-01
bathrooms_text2 shared baths                          4.139e-01  1.720e-01
bathrooms_text2.5 baths                               7.831e-01  1.682e-01
bathrooms_text2.5 shared baths                       -1.486e-01  4.674e-01
bathrooms_text3 baths                                 7.762e-01  1.704e-01
bathrooms_text3 shared baths                         -7.565e-02  4.656e-01
bathrooms_text3.5 baths                               1.217e+00  1.915e-01
bathrooms_text4 baths                                 6.600e-01  2.097e-01
bathrooms_text4 shared baths                         -3.317e-01  2.150e-01
bathrooms_text4.5 baths                               8.857e-01  2.584e-01
bathrooms_text5 baths                                 1.797e+00  3.585e-01
bathrooms_text5 shared baths                          4.383e-01  2.100e-01
bathrooms_text6 shared baths                          7.424e-01  3.468e-01
bathrooms_textPrivate half-bath                       5.641e-01  2.968e-01
bathrooms_textShared half-bath                       -1.414e-01  3.454e-01
accommodates                                          6.293e-02  1.100e-02
neighbourhood_simplifiedNorth of Downtown            -1.268e-02  3.729e-02
neighbourhood_simplifiedOutside Lands                -3.199e-01  3.448e-02
neighbourhood_simplifiedSouthern                     -2.417e-01  2.965e-02
neighbourhood_simplifiedWestern Addition             -5.433e-02  4.800e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           27.817  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.710 0.006774 ** 
prop_type_simplifiedEntire residential home           -3.580 0.000351 ***
prop_type_simplifiedOther                             -5.667 1.66e-08 ***
prop_type_simplifiedPrivate room in residential home  -8.000 2.05e-15 ***
number_of_reviews                                     -7.779 1.15e-14 ***
review_scores_rating                                   3.064 0.002215 ** 
room_typeHotel room                                    4.419 1.04e-05 ***
room_typePrivate room                                 -4.097 4.34e-05 ***
room_typeShared room                                  -5.535 3.50e-08 ***
bedrooms                                               9.571  < 2e-16 ***
beds                                                  -3.487 0.000499 ***
bathrooms_text0 shared baths                          -0.988 0.323092    
bathrooms_text1 bath                                   2.838 0.004584 ** 
bathrooms_text1 private bath                           3.254 0.001155 ** 
bathrooms_text1 shared bath                            1.526 0.127277    
bathrooms_text1.5 baths                                4.065 4.99e-05 ***
bathrooms_text1.5 shared baths                         1.848 0.064813 .  
bathrooms_text10 baths                                 0.024 0.980843    
bathrooms_text10 shared baths                         -0.791 0.428807    
bathrooms_text2 baths                                  4.355 1.40e-05 ***
bathrooms_text2 shared baths                           2.406 0.016199 *  
bathrooms_text2.5 baths                                4.656 3.44e-06 ***
bathrooms_text2.5 shared baths                        -0.318 0.750561    
bathrooms_text3 baths                                  4.556 5.51e-06 ***
bathrooms_text3 shared baths                          -0.162 0.870952    
bathrooms_text3.5 baths                                6.355 2.55e-10 ***
bathrooms_text4 baths                                  3.147 0.001671 ** 
bathrooms_text4 shared baths                          -1.543 0.123047    
bathrooms_text4.5 baths                                3.427 0.000621 ***
bathrooms_text5 baths                                  5.013 5.82e-07 ***
bathrooms_text5 shared baths                           2.087 0.037021 *  
bathrooms_text6 shared baths                           2.141 0.032400 *  
bathrooms_textPrivate half-bath                        1.900 0.057522 .  
bathrooms_textShared half-bath                        -0.409 0.682281    
accommodates                                           5.723 1.20e-08 ***
neighbourhood_simplifiedNorth of Downtown             -0.340 0.733764    
neighbourhood_simplifiedOutside Lands                 -9.278  < 2e-16 ***
neighbourhood_simplifiedSouthern                      -8.153 6.12e-16 ***
neighbourhood_simplifiedWestern Addition              -1.132 0.257794    

Residual standard error: 0.4411 on 2055 degrees of freedom
  (626 observations deleted due to missingness)
Multiple R-squared:  0.6351,    Adjusted R-squared:  0.6281 
F-statistic: 91.69 on 39 and 2055 DF,  p-value: < 2.2e-16
car::vif(model7)
                              GVIF Df GVIF^(1/(2*Df))
prop_type_simplified      4.196777  4        1.196367
number_of_reviews         1.150243  1        1.072494
review_scores_rating      1.147557  1        1.071241
room_type                21.160546  3        1.663111
bedrooms                  5.035674  1        2.244031
beds                      4.497162  1        2.120651
bathrooms_text           31.225103 23        1.077678
accommodates              5.141670  1        2.267525
neighbourhood_simplified  1.767259  4        1.073773
anova(model7)
DfSum SqMean SqF valuePr(>F)
4352     87.9   452    1.43e-279
148.5   48.5   249    4.24e-53 
11.71  1.71  8.78 0.00308  
368.2   22.7   117    7.47e-70 
1143     143     735    1.34e-138
10.08430.08430.4330.51     
2349.1   2.14  11    6.52e-38 
17.02  7.02  36.1  2.21e-09 
426.6   6.65  34.2  1.17e-27 
2055400     0.195             
autoplot(model7)

From the model above, model 7, we see that the coefficients for different property types are all statistically significant at the 1% level. Number of reviews, how many people is accommodated,beds, and bedrooms numbers are also statistically significant at the 1% level. Review score rating is significant at the 5% level, and the room type categorical variable for hotel room and shared room are also significantly different from the base value room type. The neighbourhoods outside land and southern are statistically significant when compared to the base case of downtown. The residuals of the model also look random based on the autoplots plotting the residuals and fitted values. The adjusted R-squared is high at 0.6281. However, the categorical variable bathrooms_text has a VIF 36.9, and room type variable has VIF higher than 20, which means that there is multi-collinearity in the model. The anova analysis also indicates that the beds variable is insignificant. Because of these features, we will not be using this model as a predictor of price for Airbnb’s in San Francisco.

Now, we run a model 7.1 that is the same as model 7 but with the bathrooms_text variable removed as it has a high VIF.

#bathrooms_text has high VIF so remove it
model7_1 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + beds+ accommodates + neighbourhood_simplified, data=listings_filtered_log)
msummary(model7_1)
                                                       Estimate Std. Error
(Intercept)                                           6.0756189  0.1463890
prop_type_simplifiedEntire rental unit               -0.1277471  0.0465238
prop_type_simplifiedEntire residential home          -0.1379707  0.0468867
prop_type_simplifiedOther                            -0.2460108  0.0452837
prop_type_simplifiedPrivate room in residential home -0.4493925  0.0591778
number_of_reviews                                    -0.0007638  0.0000946
review_scores_rating                                  0.1007453  0.0283988
room_typeHotel room                                   0.3002980  0.0720501
room_typePrivate room                                -0.2483238  0.0339487
room_typeShared room                                 -1.2483880  0.1003097
bedrooms                                              0.3106995  0.0233514
beds                                                 -0.0526952  0.0146805
accommodates                                          0.0797897  0.0110429
neighbourhood_simplifiedNorth of Downtown            -0.0021650  0.0383518
neighbourhood_simplifiedOutside Lands                -0.3363906  0.0351930
neighbourhood_simplifiedSouthern                     -0.2581803  0.0302352
neighbourhood_simplifiedWestern Addition             -0.0871199  0.0490584
                                                     t value Pr(>|t|)    
(Intercept)                                           41.503  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.746 0.006088 ** 
prop_type_simplifiedEntire residential home           -2.943 0.003290 ** 
prop_type_simplifiedOther                             -5.433 6.20e-08 ***
prop_type_simplifiedPrivate room in residential home  -7.594 4.66e-14 ***
number_of_reviews                                     -8.073 1.15e-15 ***
review_scores_rating                                   3.548 0.000398 ***
room_typeHotel room                                    4.168 3.20e-05 ***
room_typePrivate room                                 -7.315 3.67e-13 ***
room_typeShared room                                 -12.445  < 2e-16 ***
bedrooms                                              13.305  < 2e-16 ***
beds                                                  -3.589 0.000339 ***
accommodates                                           7.225 6.98e-13 ***
neighbourhood_simplifiedNorth of Downtown             -0.056 0.954988    
neighbourhood_simplifiedOutside Lands                 -9.558  < 2e-16 ***
neighbourhood_simplifiedSouthern                      -8.539  < 2e-16 ***
neighbourhood_simplifiedWestern Addition              -1.776 0.075905 .  

Residual standard error: 0.4596 on 2078 degrees of freedom
  (626 observations deleted due to missingness)
Multiple R-squared:  0.5995,    Adjusted R-squared:  0.5964 
F-statistic: 194.4 on 16 and 2078 DF,  p-value: < 2.2e-16
car::vif(model7_1)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     3.812113  4        1.182077
number_of_reviews        1.128927  1        1.062510
review_scores_rating     1.133487  1        1.064654
room_type                4.073428  3        1.263747
bedrooms                 4.084697  1        2.021063
beds                     4.057305  1        2.014275
accommodates             4.777204  1        2.185682
neighbourhood_simplified 1.592668  4        1.059902
anova(model7_1)
DfSum SqMean SqF valuePr(>F)
4352     87.9   416    1.4e-263 
148.5   48.5   230    2.69e-49 
11.71  1.71  8.09 0.0045   
368.2   22.7   108    8.93e-65 
1143     143     677    1.98e-129
10.08430.08430.3990.528    
112.6   12.6   59.6  1.82e-14 
431.2   7.8   36.9  7.41e-30 
2078439     0.211             
autoplot(model7_1)

From the anova function, the variable beds is still not significant, as it probably correlates with accommodates, which indicates how many people can be accommodated in the airbnb. R-squared is still quite high at 0.5964. The collinearity issue from model7 is resolved here. Therefore, we run another regression without the beds variable as indicated below:

#beds is not significant and has a VIF close to 5 so remove it, probably correlated with accommodates
#after removing bathrooms_text and beds, all coefficients are significant and VIF level of accommodates is no longer high
model7_2 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + accommodates + neighbourhood_simplified, data=listings_filtered_log)
msummary(model7_2)
                                                       Estimate Std. Error
(Intercept)                                           6.091e+00  1.467e-01
prop_type_simplifiedEntire rental unit               -1.339e-01  4.664e-02
prop_type_simplifiedEntire residential home          -1.478e-01  4.697e-02
prop_type_simplifiedOther                            -2.539e-01  4.538e-02
prop_type_simplifiedPrivate room in residential home -4.625e-01  5.916e-02
number_of_reviews                                    -7.710e-04  9.467e-05
review_scores_rating                                  1.010e-01  2.847e-02
room_typeHotel room                                   2.993e-01  7.222e-02
room_typePrivate room                                -2.504e-01  3.398e-02
room_typeShared room                                 -1.406e+00  8.998e-02
bedrooms                                              2.866e-01  2.248e-02
accommodates                                          5.865e-02  9.297e-03
neighbourhood_simplifiedNorth of Downtown             1.218e-03  3.840e-02
neighbourhood_simplifiedOutside Lands                -3.339e-01  3.513e-02
neighbourhood_simplifiedSouthern                     -2.532e-01  3.022e-02
neighbourhood_simplifiedWestern Addition             -8.167e-02  4.916e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           41.527  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.871  0.00413 ** 
prop_type_simplifiedEntire residential home           -3.148  0.00167 ** 
prop_type_simplifiedOther                             -5.596 2.48e-08 ***
prop_type_simplifiedPrivate room in residential home  -7.818 8.47e-15 ***
number_of_reviews                                     -8.144 6.53e-16 ***
review_scores_rating                                   3.546  0.00040 ***
room_typeHotel room                                    4.144 3.54e-05 ***
room_typePrivate room                                 -7.370 2.45e-13 ***
room_typeShared room                                 -15.626  < 2e-16 ***
bedrooms                                              12.752  < 2e-16 ***
accommodates                                           6.308 3.43e-10 ***
neighbourhood_simplifiedNorth of Downtown              0.032  0.97470    
neighbourhood_simplifiedOutside Lands                 -9.504  < 2e-16 ***
neighbourhood_simplifiedSouthern                      -8.380  < 2e-16 ***
neighbourhood_simplifiedWestern Addition              -1.661  0.09677 .  

Residual standard error: 0.4609 on 2087 degrees of freedom
  (618 observations deleted due to missingness)
Multiple R-squared:  0.5976,    Adjusted R-squared:  0.5947 
F-statistic: 206.6 on 15 and 2087 DF,  p-value: < 2.2e-16
car::vif(model7_2)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     3.802958  4        1.181722
number_of_reviews        1.125794  1        1.061035
review_scores_rating     1.133088  1        1.064466
room_type                3.233716  3        1.216047
bedrooms                 3.767270  1        1.940946
accommodates             3.373997  1        1.836844
neighbourhood_simplified 1.586121  4        1.059356
anova(model7_2)
DfSum SqMean SqF valuePr(>F)
4356   89.1  419   2.75e-265
147.6 47.6  224   3.12e-48 
11.681.68 7.910.00497  
368.6 22.9  108   7.86e-65 
1143   143    675   3.42e-129
19.9 9.9  46.6 1.13e-11 
431   7.74 36.4 1.8e-29  
2087443   0.212           
autoplot(model7_2)

Now, all coefficients are significant, with the neighborhood_simplified variable with location outside lands and southern as significantly different from the base category Downtown - which means that these areas are usually cheaper. VIF’s are all lower than 5, indicating no multi-colllinearity, and the high R-squared is not lost, as adjusted R-squared is 0.595. The residuals are also random based on the autoplot graphs. This is our best model so far.

4.8 Model 8

  1. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables? We test these extra two variables based on the existing model of model7_2
#availability_30 not significant and review is significant, let's see individual performance of the two
model8 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + accommodates + neighbourhood_simplified + reviews_per_month +availability_30, data=listings_filtered_log) 
msummary(model8)
                                                       Estimate Std. Error
(Intercept)                                           6.075e+00  1.489e-01
prop_type_simplifiedEntire rental unit               -1.341e-01  4.649e-02
prop_type_simplifiedEntire residential home          -1.455e-01  4.682e-02
prop_type_simplifiedOther                            -2.439e-01  4.537e-02
prop_type_simplifiedPrivate room in residential home -4.513e-01  5.905e-02
number_of_reviews                                    -7.156e-04  9.555e-05
review_scores_rating                                  1.054e-01  2.873e-02
room_typeHotel room                                   2.878e-01  7.208e-02
room_typePrivate room                                -2.624e-01  3.436e-02
room_typeShared room                                 -1.432e+00  9.098e-02
bedrooms                                              2.864e-01  2.241e-02
accommodates                                          5.794e-02  9.294e-03
neighbourhood_simplifiedNorth of Downtown             7.204e-03  3.831e-02
neighbourhood_simplifiedOutside Lands                -3.257e-01  3.513e-02
neighbourhood_simplifiedSouthern                     -2.517e-01  3.025e-02
neighbourhood_simplifiedWestern Addition             -8.201e-02  4.919e-02
reviews_per_month                                    -5.890e-03  1.519e-03
availability_30                                       8.638e-04  1.138e-03
                                                     t value Pr(>|t|)    
(Intercept)                                           40.806  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.885 0.003958 ** 
prop_type_simplifiedEntire residential home           -3.107 0.001916 ** 
prop_type_simplifiedOther                             -5.376 8.48e-08 ***
prop_type_simplifiedPrivate room in residential home  -7.644 3.20e-14 ***
number_of_reviews                                     -7.489 1.02e-13 ***
review_scores_rating                                   3.669 0.000250 ***
room_typeHotel room                                    3.992 6.77e-05 ***
room_typePrivate room                                 -7.635 3.41e-14 ***
room_typeShared room                                 -15.741  < 2e-16 ***
bedrooms                                              12.781  < 2e-16 ***
accommodates                                           6.234 5.49e-10 ***
neighbourhood_simplifiedNorth of Downtown              0.188 0.850840    
neighbourhood_simplifiedOutside Lands                 -9.271  < 2e-16 ***
neighbourhood_simplifiedSouthern                      -8.323  < 2e-16 ***
neighbourhood_simplifiedWestern Addition              -1.667 0.095647 .  
reviews_per_month                                     -3.877 0.000109 ***
availability_30                                        0.759 0.447990    

Residual standard error: 0.4595 on 2085 degrees of freedom
  (618 observations deleted due to missingness)
Multiple R-squared:  0.6005,    Adjusted R-squared:  0.5973 
F-statistic: 184.4 on 17 and 2085 DF,  p-value: < 2.2e-16
car::vif(model8)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     3.860573  4        1.183945
number_of_reviews        1.153921  1        1.074207
review_scores_rating     1.161289  1        1.077631
room_type                3.414175  3        1.227103
bedrooms                 3.768971  1        1.941384
accommodates             3.393759  1        1.842216
neighbourhood_simplified 1.619448  4        1.062113
reviews_per_month        1.063491  1        1.031257
availability_30          1.195086  1        1.093200
anova(model8)

DfSum SqMean SqF valuePr(>F)
4356    89.1  422    1.68e-266
147.6  47.6  226    1.65e-48 
11.68 1.68 7.96 0.00483  
368.6  22.9  108    3.28e-65 
1143    143    679    7.03e-130
19.9  9.9  46.9  9.73e-12 
431    7.74 36.7  1.18e-29 
13.09 3.09 14.7  0.000133 
10.1220.1220.5760.448    
2085440    0.211            
Including all the significant variables in model 7.2, we now add avalability_30 and reviews_per_month to see if they affect the the price beyond what’s explained by the variables in model 7.2. Here, we see that availability_30 is not significant, while reviews_per_month is significant. Therefore, we drop the availability_30 variable and run the model again.

#review is significant
model8_1 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + accommodates + neighbourhood_simplified + reviews_per_month , data=listings_filtered_log) 
msummary(model8_1)
                                                       Estimate Std. Error
(Intercept)                                           6.096e+00  1.462e-01
prop_type_simplifiedEntire rental unit               -1.338e-01  4.649e-02
prop_type_simplifiedEntire residential home          -1.457e-01  4.682e-02
prop_type_simplifiedOther                            -2.425e-01  4.533e-02
prop_type_simplifiedPrivate room in residential home -4.512e-01  5.904e-02
number_of_reviews                                    -7.143e-04  9.552e-05
review_scores_rating                                  1.020e-01  2.838e-02
room_typeHotel room                                   2.865e-01  7.206e-02
room_typePrivate room                                -2.583e-01  3.393e-02
room_typeShared room                                 -1.421e+00  8.977e-02
bedrooms                                              2.861e-01  2.240e-02
accommodates                                          5.848e-02  9.267e-03
neighbourhood_simplifiedNorth of Downtown             7.035e-03  3.830e-02
neighbourhood_simplifiedOutside Lands                -3.274e-01  3.506e-02
neighbourhood_simplifiedSouthern                     -2.538e-01  3.012e-02
neighbourhood_simplifiedWestern Addition             -8.520e-02  4.901e-02
reviews_per_month                                    -5.798e-03  1.514e-03
                                                     t value Pr(>|t|)    
(Intercept)                                           41.695  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.878 0.004048 ** 
prop_type_simplifiedEntire residential home           -3.113 0.001876 ** 
prop_type_simplifiedOther                             -5.349 9.80e-08 ***
prop_type_simplifiedPrivate room in residential home  -7.642 3.25e-14 ***
number_of_reviews                                     -7.477 1.11e-13 ***
review_scores_rating                                   3.595 0.000332 ***
room_typeHotel room                                    3.976 7.23e-05 ***
room_typePrivate room                                 -7.611 4.08e-14 ***
room_typeShared room                                 -15.829  < 2e-16 ***
bedrooms                                              12.770  < 2e-16 ***
accommodates                                           6.311 3.38e-10 ***
neighbourhood_simplifiedNorth of Downtown              0.184 0.854291    
neighbourhood_simplifiedOutside Lands                 -9.339  < 2e-16 ***
neighbourhood_simplifiedSouthern                      -8.424  < 2e-16 ***
neighbourhood_simplifiedWestern Addition              -1.739 0.082236 .  
reviews_per_month                                     -3.829 0.000132 ***

Residual standard error: 0.4594 on 2086 degrees of freedom
  (618 observations deleted due to missingness)
Multiple R-squared:  0.6004,    Adjusted R-squared:  0.5973 
F-statistic: 195.9 on 16 and 2086 DF,  p-value: < 2.2e-16
car::vif(model8_1)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     3.836137  4        1.183005
number_of_reviews        1.153549  1        1.074034
review_scores_rating     1.133195  1        1.064516
room_type                3.250392  3        1.217090
bedrooms                 3.767418  1        1.940984
accommodates             3.374074  1        1.836865
neighbourhood_simplified 1.596601  4        1.060229
reviews_per_month        1.056740  1        1.027978
anova(model8_1)

DfSum SqMean SqF valuePr(>F)
4356   89.1  422   1.42e-266
147.6 47.6  226   1.61e-48 
11.681.68 7.960.00483  
368.6 22.9  108   3.17e-65 
1143   143    680   6.55e-130
19.9 9.9  46.9 9.68e-12 
431   7.74 36.7 1.16e-29 
13.093.09 14.7 0.000132 
2086440   0.211           
All coefficients are significant, and adjusted R-squared is even higher at 0.597. This model tops model7_2.

4.9 Model 9

Here, we check to see if without the review per month variable, availability_30 would be significant.

#availability_30 still not, so only include review
model9 <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + accommodates + neighbourhood_simplified + availability_30 , data=listings_filtered_log) 
msummary(model9)
                                                       Estimate Std. Error
(Intercept)                                           6.079e+00  1.494e-01
prop_type_simplifiedEntire rental unit               -1.341e-01  4.665e-02
prop_type_simplifiedEntire residential home          -1.477e-01  4.698e-02
prop_type_simplifiedOther                            -2.549e-01  4.544e-02
prop_type_simplifiedPrivate room in residential home -4.627e-01  5.917e-02
number_of_reviews                                    -7.723e-04  9.474e-05
review_scores_rating                                  1.030e-01  2.882e-02
room_typeHotel room                                   3.001e-01  7.226e-02
room_typePrivate room                                -2.528e-01  3.439e-02
room_typeShared room                                 -1.412e+00  9.114e-02
bedrooms                                              2.868e-01  2.248e-02
accommodates                                          5.833e-02  9.325e-03
neighbourhood_simplifiedNorth of Downtown             1.264e-03  3.840e-02
neighbourhood_simplifiedOutside Lands                -3.330e-01  3.520e-02
neighbourhood_simplifiedSouthern                     -2.520e-01  3.035e-02
neighbourhood_simplifiedWestern Addition             -7.974e-02  4.935e-02
availability_30                                       5.122e-04  1.138e-03
                                                     t value Pr(>|t|)    
(Intercept)                                           40.695  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -2.875 0.004083 ** 
prop_type_simplifiedEntire residential home           -3.144 0.001688 ** 
prop_type_simplifiedOther                             -5.610 2.30e-08 ***
prop_type_simplifiedPrivate room in residential home  -7.819 8.35e-15 ***
number_of_reviews                                     -8.152 6.10e-16 ***
review_scores_rating                                   3.572 0.000362 ***
room_typeHotel room                                    4.154 3.40e-05 ***
room_typePrivate room                                 -7.351 2.82e-13 ***
room_typeShared room                                 -15.498  < 2e-16 ***
bedrooms                                              12.757  < 2e-16 ***
accommodates                                           6.255 4.80e-10 ***
neighbourhood_simplifiedNorth of Downtown              0.033 0.973754    
neighbourhood_simplifiedOutside Lands                 -9.459  < 2e-16 ***
neighbourhood_simplifiedSouthern                      -8.305  < 2e-16 ***
neighbourhood_simplifiedWestern Addition              -1.616 0.106299    
availability_30                                        0.450 0.652807    

Residual standard error: 0.461 on 2086 degrees of freedom
  (618 observations deleted due to missingness)
Multiple R-squared:  0.5976,    Adjusted R-squared:  0.5946 
F-statistic: 193.6 on 16 and 2086 DF,  p-value: < 2.2e-16
car::vif(model9)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     3.830646  4        1.182794
number_of_reviews        1.126851  1        1.061532
review_scores_rating     1.160729  1        1.077371
room_type                3.390405  3        1.225675
bedrooms                 3.768890  1        1.941363
accommodates             3.393361  1        1.842108
neighbourhood_simplified 1.609629  4        1.061306
availability_30          1.187499  1        1.089724
anova(model9)

DfSum SqMean SqF valuePr(>F)
4356    89.1  419    3.52e-265
147.6  47.6  224    3.26e-48 
11.68 1.68 7.91 0.00497  
368.6  22.9  108    8.33e-65 
1143    143    675    3.84e-129
19.9  9.9  46.6  1.14e-11 
431    7.74 36.4  1.85e-29 
10.0430.0430.2020.653    
2086443    0.213            
The variable is still not significant. Therefore, we conclude that model8_1 is our best model.

4.10 Summary Tables

After examining all models, model8_1 is the one with the most explanatory power, given its high R-squared and significant coefficients. Therefore, we will be using model8_1 as our main predictor model for price of airbnb in San Francisco.

Summary Table

huxreg(model1,model2,model3,model4_1,model5,model6,model7_2,model8_1, 
       statistics = c("N" = "nobs",
                      "R2" = "r.squared",
                      "Adj.R2"="adj.r.squared",
                      "Residual SE"="sigma")) 

(1)(2)(3)(4)(5)(6)(7)(8)
(Intercept)6.812 ***6.755 ***5.755 ***6.066 ***6.096 ***6.133 ***6.091 ***6.096 ***
(0.157)   (0.157)   (0.160)   (0.151)   (0.152)   (0.151)   (0.147)   (0.146)   
prop_type_simplifiedEntire rental unit-0.177 ***-0.182 ***        -0.191 ***-0.193 ***-0.182 ***-0.134 ** -0.134 ** 
(0.053)   (0.051)           (0.048)   (0.048)   (0.047)   (0.047)   (0.046)   
prop_type_simplifiedEntire residential home0.095    0.092            -0.236 ***-0.239 ***-0.225 ***-0.148 ** -0.146 ** 
(0.052)   (0.050)           (0.047)   (0.047)   (0.047)   (0.047)   (0.047)   
prop_type_simplifiedOther-0.580 ***-0.492 ***        -0.314 ***-0.318 ***-0.296 ***-0.254 ***-0.242 ***
(0.045)   (0.048)           (0.046)   (0.046)   (0.046)   (0.045)   (0.045)   
prop_type_simplifiedPrivate room in residential home-1.023 ***-0.884 ***        -0.649 ***-0.661 ***-0.654 ***-0.462 ***-0.451 ***
(0.052)   (0.061)           (0.057)   (0.058)   (0.057)   (0.059)   (0.059)   
number_of_reviews-0.001 ***-0.001 ***        -0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***
(0.000)   (0.000)           (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating0.078 *  0.089 **         0.082 ** 0.072 *  0.074 *  0.101 ***0.102 ***
(0.031)   (0.031)           (0.029)   (0.030)   (0.029)   (0.028)   (0.028)   
room_typeHotel room        0.294 ***        0.468 ***0.482 ***0.532 ***0.299 ***0.287 ***
        (0.076)           (0.072)   (0.073)   (0.073)   (0.072)   (0.072)   
room_typePrivate room        -0.146 ***        -0.145 ***-0.136 ***-0.118 ***-0.250 ***-0.258 ***
        (0.034)           (0.033)   (0.034)   (0.034)   (0.034)   (0.034)   
room_typeShared room        -1.190 ***        -1.046 ***-1.046 ***-1.052 ***-1.406 ***-1.421 ***
        (0.103)           (0.101)   (0.101)   (0.100)   (0.090)   (0.090)   
bedrooms                0.223 ***0.293 ***0.295 ***0.285 ***0.287 ***0.286 ***
                (0.030)   (0.024)   (0.024)   (0.024)   (0.022)   (0.022)   
beds                -0.088 ***-0.054 ***-0.055 ***-0.050 ***                
                (0.017)   (0.015)   (0.015)   (0.015)                   
bathrooms_text0 shared baths                -0.366                                            
                (0.335)                                           
bathrooms_text1 bath                0.508 **                                         
                (0.160)                                           
bathrooms_text1 private bath                0.356 *                                          
                (0.160)                                           
bathrooms_text1 shared bath                -0.059                                            
                (0.163)                                           
bathrooms_text1.5 baths                0.690 ***                                        
                (0.169)                                           
bathrooms_text1.5 shared baths                -0.022                                            
                (0.181)                                           
bathrooms_text10 baths                -0.099                                            
                (0.447)                                           
bathrooms_text10 shared baths                -0.844 ***                                        
                (0.216)                                           
bathrooms_text2 baths                0.756 ***                                        
                (0.165)                                           
bathrooms_text2 shared baths                0.024                                            
                (0.194)                                           
bathrooms_text2.5 baths                0.862 ***                                        
                (0.186)                                           
bathrooms_text2.5 shared baths                -0.163                                            
                (0.447)                                           
bathrooms_text3 baths                0.968 ***                                        
                (0.188)                                           
bathrooms_text3 shared baths                -0.122                                            
                (0.612)                                           
bathrooms_text3.5 baths                1.370 ***                                        
                (0.220)                                           
bathrooms_text4 baths                0.862 ***                                        
                (0.251)                                           
bathrooms_text4 shared baths                -0.248                                            
                (0.262)                                           
bathrooms_text4.5 baths                1.387 ***                                        
                (0.302)                                           
bathrooms_text5 baths                1.913 ***                                        
                (0.462)                                           
bathrooms_text5 shared baths                0.210                                            
                (0.245)                                           
bathrooms_text6 shared baths                0.765                                            
                (0.449)                                           
bathrooms_text6.5 shared baths                0.670                                            
                (0.613)                                           
bathrooms_textPrivate half-bath                -0.031                                            
                (0.376)                                           
bathrooms_textShared half-bath                -0.348                                            
                (0.447)                                           
accommodates                0.091 ***0.085 ***0.086 ***0.085 ***0.059 ***0.058 ***
                (0.014)   (0.011)   (0.011)   (0.011)   (0.009)   (0.009)   
factor(host_is_superhost)TRUE                                0.035                            
                                (0.023)                           
factor(instant_bookable)TRUE                                        -0.114 ***                
                                        (0.022)                   
neighbourhood_simplifiedNorth of Downtown                                                0.001    0.007    
                                                (0.038)   (0.038)   
neighbourhood_simplifiedOutside Lands                                                -0.334 ***-0.327 ***
                                                (0.035)   (0.035)   
neighbourhood_simplifiedSouthern                                                -0.253 ***-0.254 ***
                                                (0.030)   (0.030)   
neighbourhood_simplifiedWestern Addition                                                -0.082    -0.085    
                                                (0.049)   (0.049)   
reviews_per_month                                                        -0.006 ***
                                                        (0.002)   
N2474        2474        2316        2095        2093        2095        2103        2103        
R20.330    0.374    0.424    0.571    0.571    0.576    0.598    0.600    
Adj.R20.328    0.371    0.417    0.569    0.569    0.574    0.595    0.597    
Residual SE0.572    0.553    0.591    0.475    0.475    0.472    0.461    0.459    
*** p < 0.001; ** p < 0.01; * p < 0.05.
From the summary table of all the models, we can see that model 8 has the highest R-squared, with all significant variables.

Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights. Here we make a few assumptions for the prediction: - property_type_simplified = private room in residential home - reviews_per_month = 10 - number_of_reviews > 10 - at least 90 average rating - neighbourhood_simplified = downtown

We filter the data first:

listings_filtered_log_predict <- listings_filtered_log %>%
  filter(number_of_reviews >= 10, 
         reviews_per_month >0, 
         review_scores_rating >= 4.5)

#prediction1 <- predict(model8_1, newdata=listings_filtered_log_predict)
#rmse <- sqrt(sum((exp(prediction1) -  #listings_filtered_log_predict$log_price4)^2)/length(listings_filtered_log_predict$log_price4))
#c(RMSE = rmse, R2=summary(model8_1)$r.squared)

#par(mfrow=c(1,1))
#plot(listings_filtered_log_predict$log_price4, exp(prediction1), xlim=c(0,1000), ylim=c(0,1000))

Then we run the model again:

model8_p <- lm(log_price4 ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bedrooms + accommodates + neighbourhood_simplified + reviews_per_month , data=listings_filtered_log_predict) 
msummary(model8_p)
                                                       Estimate Std. Error
(Intercept)                                           1.798e+00  4.277e-01
prop_type_simplifiedEntire rental unit               -7.773e-02  4.474e-02
prop_type_simplifiedEntire residential home          -7.637e-02  4.516e-02
prop_type_simplifiedOther                            -2.344e-01  4.448e-02
prop_type_simplifiedPrivate room in residential home -4.421e-01  5.673e-02
number_of_reviews                                    -4.291e-04  8.548e-05
review_scores_rating                                  9.714e-01  8.652e-02
room_typeHotel room                                   3.825e-01  1.345e-01
room_typePrivate room                                -2.211e-01  3.339e-02
room_typeShared room                                 -1.228e+00  7.529e-02
bedrooms                                              3.112e-01  2.108e-02
accommodates                                          3.241e-02  8.424e-03
neighbourhood_simplifiedNorth of Downtown             2.622e-02  3.850e-02
neighbourhood_simplifiedOutside Lands                -2.894e-01  3.470e-02
neighbourhood_simplifiedSouthern                     -2.117e-01  3.143e-02
neighbourhood_simplifiedWestern Addition             -9.264e-02  4.879e-02
reviews_per_month                                    -3.001e-03  1.246e-03
                                                     t value Pr(>|t|)    
(Intercept)                                            4.205 2.77e-05 ***
prop_type_simplifiedEntire rental unit                -1.737 0.082542 .  
prop_type_simplifiedEntire residential home           -1.691 0.091042 .  
prop_type_simplifiedOther                             -5.270 1.57e-07 ***
prop_type_simplifiedPrivate room in residential home  -7.794 1.22e-14 ***
number_of_reviews                                     -5.020 5.81e-07 ***
review_scores_rating                                  11.227  < 2e-16 ***
room_typeHotel room                                    2.843 0.004533 ** 
room_typePrivate room                                 -6.621 4.98e-11 ***
room_typeShared room                                 -16.315  < 2e-16 ***
bedrooms                                              14.762  < 2e-16 ***
accommodates                                           3.847 0.000125 ***
neighbourhood_simplifiedNorth of Downtown              0.681 0.495938    
neighbourhood_simplifiedOutside Lands                 -8.340  < 2e-16 ***
neighbourhood_simplifiedSouthern                      -6.737 2.32e-11 ***
neighbourhood_simplifiedWestern Addition              -1.899 0.057777 .  
reviews_per_month                                     -2.408 0.016153 *  

Residual standard error: 0.3641 on 1468 degrees of freedom
  (181 observations deleted due to missingness)
Multiple R-squared:  0.7081,    Adjusted R-squared:  0.7049 
F-statistic: 222.6 on 16 and 1468 DF,  p-value: < 2.2e-16
model8_p %>% broom::tidy()
termestimatestd.errorstatisticp.value
(Intercept)1.8     0.428   4.2  2.77e-05
prop_type_simplifiedEntire rental unit-0.0777  0.0447  -1.74 0.0825  
prop_type_simplifiedEntire residential home-0.0764  0.0452  -1.69 0.091   
prop_type_simplifiedOther-0.234   0.0445  -5.27 1.57e-07
prop_type_simplifiedPrivate room in residential home-0.442   0.0567  -7.79 1.22e-14
number_of_reviews-0.0004298.55e-05-5.02 5.81e-07
review_scores_rating0.971   0.0865  11.2  4.05e-28
room_typeHotel room0.382   0.135   2.84 0.00453 
room_typePrivate room-0.221   0.0334  -6.62 4.98e-11
room_typeShared room-1.23    0.0753  -16.3  4.03e-55
bedrooms0.311   0.0211  14.8  4.38e-46
accommodates0.0324  0.00842 3.85 0.000125
neighbourhood_simplifiedNorth of Downtown0.0262  0.0385  0.6810.496   
neighbourhood_simplifiedOutside Lands-0.289   0.0347  -8.34 1.7e-16 
neighbourhood_simplifiedSouthern-0.212   0.0314  -6.74 2.32e-11
neighbourhood_simplifiedWestern Addition-0.0926  0.0488  -1.9  0.0578  
reviews_per_month-0.003   0.00125 -2.41 0.0162  
model8_p %>% broom::glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.7080.7050.364223016-5981.23e+031.33e+0319514681485
#get mean of each continuous variable in model 8 
favstats(~number_of_reviews,data=listings_filtered_log_predict)
minQ1medianQ3maxmeansdnmissing
10296916386111411716660
favstats(~review_scores_rating,data=listings_filtered_log_predict)
minQ1medianQ3maxmeansdnmissing
4.54.84.894.9554.860.11816660
favstats(~bedrooms,data=listings_filtered_log_predict)
minQ1medianQ3maxmeansdnmissing
111271.540.8551485181
favstats(~accommodates,data=listings_filtered_log_predict)
minQ1medianQ3maxmeansdnmissing
2234163.521.9416660
favstats(~reviews_per_month,data=listings_filtered_log_predict)

minQ1medianQ3maxmeansdnmissing
0.111.062.414.291263.967.6316660
Let’s assume that we stay in a Private room in residential home, with private room type in downtown (base case) neighborhood. Let’s take the mean number_of_reviews 114, mean review_scores_rating 4.86, median bedrooms 2, median accommodates 3, and mean reviews_per_month of 3.96.

We plug it into our model to get the following: log_price4 = 1.798275 - 0.442116 + 114(-0.000429) + 4.86(0.971359) - 0.221062 + 2(0.311219) + 3(0.032410) + 3.96(-0.0003001) = 6.5255

Taking e^(6.5255) = $682.30 This is the price for four nights in San Francisco downtown based on the assumptions above.

95% Confidence interval: The sigma of the model is 0.364. Varying log_price4 around 1.96 ± 0.364 is between [5.812,7.239] Taking e to the power of the values yield: [334.3,1392.7]

We are 95% confident that given the above assumptions - the price of the airbnb for four nights lie within the stated range of [334.3,1392.7] dollars.

4.11 Predictive Test

We further run a predictive test on the model, with the data split into training and testing.

library(rsample)
set.seed(1234)
train_split <- initial_split(listings_filtered_log, prop = 0.7)
price_train <- training(train_split)
price_test <- testing(train_split)

rmse_train <- price_train%>%
  mutate(predictions = predict(model8_1, price_train))%>%
  summarise(sqrt(sum(!is.na(predictions) - !is.na(log_price4))**2/n()))%>%
  pull()
rmse_train
[1] 10.17535
rmse_test <- price_test%>%
  mutate(predictions = predict(model8_1, .))%>%
  summarise(sqrt(sum(!is.na(predictions) - !is.na(log_price4))**2/n()))%>%
  pull()
rmse_test
[1] 6.087489

RMSE, or Root Mean Squared Error is the statistic to calculate relative performance of our model. The RMSE is quite small on the test set, even smaller on the training set, so the out-of-sample testing shows that the accuracy of our best model is high.

pred <- predict(model8_1, listings_filtered_log, interval = "confidence")
predict_and_original_data <- cbind(listings_filtered_log, pred)

To further improve the analysis, we might consider factors around hosts, like how long they’ve been hosts, and other specific ratings review, like cleaniless, location, etc. Interestingly, our chosen model model8_1 actually fits the filtered dataset (with review ratings, review scores higher than a value, etc) significantly better, with adjusted R-squared of up to 0.705. This indicates that there may be quite a bit of noise in the data, and if we try to eliminate those, the more explanatory power we have.

5 Acknowledgements

Deliverables

  • By midnight on Monday 18 Oct 2021, you must upload on Canvas a short presentation (max 4-5 slides) with your findings, as some groups will be asked to present in class. You should present your Exploratory Data Analysis, as well as your best model. In addition, you must upload on Canvas your final report, written using R Markdown to introduce, frame, and describe your story and findings. You should include the following in the memo:
  1. Executive Summary: Based on your best model, indicate the factors that influence price_4_nights. This should be written for an intelligent but non-technical audience. All other sections can include technical writing.
  2. Data Exploration and Feature Selection: Present key elements of the data, including tables and graphs that help the reader understand the important variables in the dataset. Describe how the data was cleaned and prepared, including feature selection, transformations, interactions, and other approaches you considered.
  3. Model Selection and Validation: Describe the model fitting and validation process used. State the model you selected and why they are preferable to other choices.
  4. Findings and Recommendations: Interpret the results of the selected model and discuss additional steps that might improve the analysis

Rubric Your work will be assessed on a rubric which you can find here